Build Simulation (Operator Toolbox)
Synopsis
This operator allows you to build a new ExampleSet with similar statistical properties to a reference ExampleSet.Description
This operator extracts the statistical properties from a reference ExampleSet (e.g. mean and standard deviation) and then builds a new one with the same statistical distribution as the input.
The sample size of the new simulated ExampleSet can be specified as well as the algorithm used to created the new values for the simulated Examples. The operator also provides the SimulationModel used to create the simulated ExampleSets as an object at the mod output port. It can be stored and reused to create further simulated ExampleSets with the same properties by connecting it to the mod input port of a Build Simulation operator.
This operator allows the user to set attribute values of the generated data to constant values. This can be done in two ways. Either by using the constant_attributes parameter, which allows a manual definition of constant attributes. Alternatively, the user can provide an ExampleSet on the ''con'' port and specify a list of attributes and their values. This is generally preferable if there are many attributes which are supposed to be constant.
Input
- exa (Data Table)
The reference ExampleSet.
- mod
A SimulationModel. If connected the operator will not fit a new simulation model, but use the provided one.
- constant (Data Table)
An ExampleSet with the information on constant attributes. The name of the name and value attributes can be set with the corresponding attributes.
Output
- exa (Data Table)
The simulated ExampleSet.
- mod
The SimulationModel, which can be used in another Build Simulation operator to avoid refitting the model. If the mod input is connected, you will receive the passed through simulation model, if not you will receive the fitted one.
- ori (Data Table)
The original ExampleSet.
Parameters
- sample_size The number of simulated rows desired for the output. Range:
- algorithm
This parameter allows you to select the algorithm used. It has the following options:
- normal_distribution: With this setting the operator assumes that each attribute in the reference ExampleSet is statistically independent from one another and follows its own normal distribution. The mean and the standard deviation for each input attribute is computed, and then a final new value x is built using the formula: x = (r*s)+m where r is a normally distributed random number, s the standard deviation, and m the mean of the respective attribute.
- correlated_normal_distribution: With this setting all attribute values are derived from a multi-dimensional, correlated normal distribution. Each new row X in the input ExampleSet is built using the formula: X = (R*L)+m where R is a row with normally distributed random data and L is the covariance matrix using Cholsky decomposition.
- empirical_distribution: With this set setting the operator uses a probability distribution derived from observed data without making any assumptions about the functional form of the population distribution that the data come from. We assume that every attribute is independed from another, and we can fit independend distributions for each of them. For detais on the implementation see: http://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/random/EmpiricalDistribution.html
- constant_attributes Allows you to manually specify constant attributes and their respective values. Range:
- name_attribute If the constant port is connected, then you can provide the attribute names and values using an exampleset. This parameter defines which of the attributes in this constants exampleset contains the name of the attribute, which is to be constant. Range:
- value_attribute If the constant port is connected, then you can provide the attribute names and values using an exampleset. This parameter defines which of the attributes in this constants exampleset contains the value of the attribute, which is to be constant. Range:
- use_local_random_seed This parameter indicates if a local random seed should be used. Range:
- local_random_seed If the use local random seed parameter is checked this parameter determines the local random seed. Range:
Tutorial Processes
Build Simulation on statistically independent attributes.
In this tutorial process there is a sample ExampleSet (a CSV inside the Create ExampleSet operator) that contains five statistically independent, normally distributed attributes. These attributes (num1, num2, num3, num4, and num5) are used by the Build Simulation operator to build five new attributes (sim1, sim2, sim3, sim4, and sim5) that each have the same mean and standard deviation as their original counterparts (i.e. the mean and standard deviation of num2 and sim2 are equal, but not equal to those in num/sim3, num/sim4, or num/sim5).
Build Simulation on correlated attributes
In this tutorial process there is a sample five-dimensional, normally distributed ExampleSet (a CSV inside the Create ExampleSet operator). These attributes (num1, num2, num3, num4, and num5) are used by the Build Simulation operator to build a new data set with five new attributes (sim1, sim2, sim3, sim4, and sim5) that will preserve the same mean and standard deviation as the original data set.
Reuse of the simulation model
In this tutorial process we generate first random data. Afterwards we build a simulation model on this. The simulation model is used in a second Build Simulation Operator. This reduces the runtime of the second Build Simulation operator significantly..
Generate new Third Class Titanic Passengers.
In this tutorial process we want to generate new passengers for the titanic. All of the newly generated titanic passengers should come from the Third Class.
To do this we first remove the nominal sex attribute and convertes the passenger class into a numerical value (1,2,3). We then use the Build Simulation operator to generate new data points. The constant_attribute setting is used to only generate passengers of the Third class.
This process also shows the limitations of the used methods. You may encounter attributes like number "No of Siblings or Spouses on Board" with negative or floating point values, which is an artifact of the generation method and may need to be corrected afterwards.